Explaining Models or Modelling Explanations

Challenging Existing Paradigms in Trustworthy AI

Delft University of Technology

Arie van Deursen
Cynthia C. S. Liem

May 1, 2025

Background

  • Question: How can we explain predictions of opaque machine learning models and make them “explain themselves better”?
  • Methods: Counterfactual Explanations, Algorithmic Recourse, Probabilistic Machine Learning, Energy-Based Models, Conformal Prediction, …
  • Applications: Mostly finance and economics but also images and natural language.
  • Tools: I am a Julia developer and the founder of Taija, an organization for trustworthy AI in Julia.

Questions

  • What are counterfactual explanations (CE) and algorithmic recourse (AR) and why are they useful?
  • What dynamics are generated when off-the-shelf solutions to CE and AR are implemented in practice?
  • Can we generate plausible counterfactuals relying only on the opaque model itself?
  • How can we leverage counterfactuals during training to build more trustworthy models?

Counterfactual Explanations

\[ \begin{aligned} \min_{\mathbf{Z}^\prime \in \mathcal{Z}^L} \{ {\text{yloss}(M_{\theta}(f(\mathbf{Z}^\prime)),\mathbf{y}^+)} + \lambda {\text{cost}(f(\mathbf{Z}^\prime)) } \} \end{aligned} \]

Counterfactual Explanations (CE) explain how inputs into a model need to change for it to produce different outputs.

📜 Altmeyer, Deursen, and Liem (2023) @ JuliaCon 2022.

Figure 1: Gradient-based counterfactual search.

Algorithmic Recourse

Provided CE is valid, plausible and actionable, it can be used to provide recourse to individuals negatively affected by models.

Figure 2: Counterfactuals for random samples from the Give Me Some Credit dataset (Kaggle 2011). Features ‘age’ and ‘income’ are shown.

Hidden costs of implausible counterfactuals …

Dynamics of Counterfactuals

📜 Altmeyer, Angela, et al. (2023) @ SaTML 2023.

Figure 3: Illustration of external cost of individual recourse.

Plausibility at all cost?

Pick your Poison

All of these counterfactuals are valid explanations for the model’s prediction.

Which one would you pick?

Figure 4: Turning a 9 into a 7: Counterfactual explanations for an image classifier produced using Wachter (Wachter, Mittelstadt, and Russell 2017), Schut (Schut et al. 2021) and REVISE (Joshi et al. 2019).

ECCCos from the Black-Box

📜 Altmeyer, Farmanbar, et al. (2023) @ AAAI 2024

Key Idea

Use the hybrid objective of joint energy models (JEM) and a model-agnostic penalty for predictive uncertainty: Energy-Constrained (\(\mathcal{E}_{\theta}\)) Conformal (\(\Omega\)) Counterfactuals (ECCCo).

ECCCo objective1:

\[ \begin{aligned} & \min_{\mathbf{Z}^\prime \in \mathcal{Z}^L} \{ {L_{\text{clf}}(f(\mathbf{Z}^\prime);M_{\theta},\mathbf{y}^+)}+ \lambda_1 {\text{cost}(f(\mathbf{Z}^\prime)) } \\ &+ \lambda_2 \mathcal{E}_{\theta}(f(\mathbf{Z}^\prime)|\mathbf{y}^+) + \lambda_3 \Omega(C_{\theta}(f(\mathbf{Z}^\prime);\alpha)) \} \end{aligned} \]

Figure 5: Gradient fields and counterfactual paths for different generators.

Faithful Counterfactuals

Figure 6: Turning a 9 into a 7. ECCCo applied to MLP (a), Ensemble (b), JEM (c), JEM Ensemble (d).

ECCCo generates counterfactuals that

  • faithfully represent model quality (Figure 6).
  • achieve state-of-the-art plausibility (Figure 7).
Figure 7: Results for different generators (from 3 to 5).

Teaching models plausible explanations

Counterfactual Training: Method

Idea

Let the model compare its own explanations to plausible ones.

  • Generate faithful counterfactuals on the fly.
  • Bonus: get adversarial examples for free.

Explanation or attack?

Counterfactual Training: Results

Illustration of how CT improves model explainability: (a) conventional training, all mutable; (b) CT, all mutable; (c) conventional, age immutable; (d) CT, age immutable.

  • Models trained with CT learn more plausible and (provably) actionable explanations.
  • Predictive performance does not suffer, robust performance improves.

If we still have time …

Spurious Sparks of AGI

📜 In Altmeyer et al. (2024) @ ICML 2024, we challenge the idea that the finding of meaningful patterns in latent spaces of large models is indicative of AGI.

Figure 8: Inflation of prices or birds? It doesn’t matter!

Taija

  • Work presented @ JuliaCon 2022, 2023, 2024.
  • Running project @ Google Summer of Code 2024.
  • Total of three software projects @ TU Delft.

Trustworthy AI in Julia: github.com/JuliaTrustworthyAI

References

Altmeyer, Patrick, Giovan Angela, Aleksander Buszydlik, Karol Dobiczek, Arie van Deursen, and Cynthia CS Liem. 2023. “Endogenous Macrodynamics in Algorithmic Recourse.” In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), 418–31. IEEE.
Altmeyer, Patrick, Andrew M. Demetriou, Antony Bartlett, and Cynthia C. S. Liem. 2024. “Position Paper: Against Spurious Sparks-Dovelating Inflated AI Claims.” https://arxiv.org/abs/2402.03962.
Altmeyer, Patrick, Arie van Deursen, and Cynthia C. S. Liem. 2023. Explaining Black-Box Models through Counterfactuals.” In Proceedings of the JuliaCon Conferences, 1:130.
Altmeyer, Patrick, Mojtaba Farmanbar, Arie van Deursen, and Cynthia C. S. Liem. 2023. “Faithful Model Explanations Through Energy-Constrained Conformal Counterfactuals.” https://arxiv.org/abs/2312.10648.
Grathwohl, Will, Kuan-Chieh Wang, Joern-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi, and Kevin Swersky. 2020. “Your Classifier Is Secretly an Energy Based Model and You Should Treat It Like One.” In International Conference on Learning Representations.
Joshi, Shalmali, Oluwasanmi Koyejo, Warut Vijitbenjaronk, Been Kim, and Joydeep Ghosh. 2019. Towards Realistic Individual Recourse and Actionable Explanations in Black-Box Decision Making Systems.” https://arxiv.org/abs/1907.09615.
Kaggle. 2011. “Give Me Some Credit, Improve on the State of the Art in Credit Scoring by Predicting the Probability That Somebody Will Experience Financial Distress in the Next Two Years.” https://www.kaggle.com/c/GiveMeSomeCredit; Kaggle. https://www.kaggle.com/c/GiveMeSomeCredit.
Schut, Lisa, Oscar Key, Rory McGrath, Luca Costabello, Bogdan Sacaleanu, Yarin Gal, et al. 2021. “Generating Interpretable Counterfactual Explanations By Implicit Minimisation of Epistemic and Aleatoric Uncertainties.” In International Conference on Artificial Intelligence and Statistics, 1756–64. PMLR.
Stutz, David, Krishnamurthy, Dvijotham, Ali Taylan Cemgil, and Arnaud Doucet. 2022. “Learning Optimal Conformal Classifiers.” https://arxiv.org/abs/2110.09192.
Wachter, Sandra, Brent Mittelstadt, and Chris Russell. 2017. “Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR.” Harv. JL & Tech. 31: 841. https://doi.org/10.2139/ssrn.3063289.